AITopics | online exploration

Collaborating Authors

online exploration

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

Neural Information Processing SystemsJun-14-2026, 05:47:06 GMT

Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios.

artificial intelligence, machine learning, reinforcement learning, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.60)

Add feedback

Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs Kevin Tan, Wei Fan, Y uting Wei Department of Statistics and Data Science The Wharton School, University of Pennsylvania

Neural Information Processing SystemsFeb-18-2026, 08:25:09 GMT

Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022b) is whether hybrid RL can improve upon the existing lower bounds established for purely of-fline or online RL without requiring that the behavior policy visit every state and action the optimal policy does. While Li et al. (2023b) provided an affirmative answer for tabular P AC RL, the question remains unsettled for both the regret-minimizing and non-tabular cases. In this work, building upon recent advancements in offline RL and reward-agnostic exploration, we develop computationally efficient algorithms for both P AC and regret-minimizing RL with linear function approximation, without requiring concentrability on the entire state-action space. We demonstrate that these algorithms achieve sharper error or regret bounds that are no worse than, and can improve on, the optimal sample complexity in offline RL (the first algorithm, for P AC RL) and online RL (the second algorithm, for regret-minimizing RL) in linear Markov decision processes (MDPs), regardless of the quality of the behavior policy. To our knowledge, this work establishes the tightest theoretical guarantees currently available for hybrid RL in linear MDPs.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania (0.40)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.92)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs Kevin Tan, Wei Fan, Y uting Wei Department of Statistics and Data Science The Wharton School, University of Pennsylvania

Neural Information Processing SystemsOct-10-2025, 18:25:27 GMT

algorithm, hybrid rl, partition, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania (0.40)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.92)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration

Bose, Avinandan, Xiong, Zhihan, Saha, Aadirupa, Du, Simon Shaolei, Fazel, Maryam

arXiv.org Artificial IntelligenceDec-13-2024

Reinforcement Learning from Human Feedback (RLHF) stands out as the primary method for aligning large language models with human preferences (Bai et al., 2022; Christiano et al., 2017; Ouyang et al., 2022). Instead of starting from scratch with unsupervised training on extensive datasets, RLHF aligns pre-trained models using labeled human preferences on pairs of responses, offering a statistically lightweight approach to making language models more human-like. While labeling response pairs is easier than generating new responses, the volume of these pairs is critical for effective alignment. A large dataset is needed to ensure broad coverage of linguistic nuances, reduce the impact of noisy human feedback, and provide enough statistical power for the model to generalize well. Although labeling individual pairs is simpler, scaling this process can still become resource-intensive, making the volume of response pairs a key factor in successful model alignment. In the light of this, recently a theoretical question of interest has arisen: How can algorithms be designed to be sample-efficient during this alignment phase? Two main approaches have emerged in addressing this question: online RLHF and offline RLHF. Online methods (Cen et al., 2024; Xie et al., 2024; Zhang et al., 2024) have interactive access to human feedback or leverage a more powerful language model to explore diverse and novel responses beyond what the pre-trained model can provide.

artificial intelligence, arxiv preprint arxiv, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2412.10616

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs

Tan, Kevin, Fan, Wei, Wei, Yuting

arXiv.org Artificial IntelligenceAug-8-2024

Hybrid Reinforcement Learning (RL), where an agent learns from both an offline dataset and online explorations in an unknown environment, has garnered significant recent interest. A crucial question posed by Xie et al. (2022) is whether hybrid RL can improve upon the existing lower bounds established in purely offline and purely online RL without relying on the single-policy concentrability assumption. While Li et al. (2023) provided an affirmative answer to this question in the tabular PAC RL case, the question remains unsettled for both the regret-minimizing RL case and the non-tabular case. In this work, building upon recent advancements in offline RL and reward-agnostic exploration, we develop computationally efficient algorithms for both PAC and regret-minimizing RL with linear function approximation, without single-policy concentrability. We demonstrate that these algorithms achieve sharper error or regret bounds that are no worse than, and can improve on, the optimal sample complexity in offline RL (the first algorithm, for PAC RL) and online RL (the second algorithm, for regret-minimizing RL) in linear Markov decision processes (MDPs), regardless of the quality of the behavior policy. To our knowledge, this work establishes the tightest theoretical guarantees currently available for hybrid RL in linear MDPs.

algorithm, partition, wagenmaker and pacchiano, (13 more...)

arXiv.org Artificial Intelligence

2408.04526

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance

Lu, Jingxian, Xia, Wenke, Wang, Dong, Wang, Zhigang, Zhao, Bin, Hu, Di, Li, Xuelong

arXiv.org Artificial IntelligenceAug-8-2024

Online Imitation Learning methods struggle with the gap between extensive online exploration space and limited expert trajectories, which hinder efficient exploration due to inaccurate task-aware reward estimation. Inspired by the findings from cognitive neuroscience that task decomposition could facilitate cognitive processing for efficient learning, we hypothesize that an agent could estimate precise task-aware imitation rewards for efficient online exploration by decomposing the target task into the objectives of "what to do" and the mechanisms of "how to do". In this work, we introduce the hybrid Key-state guided Online Imitation (KOI) learning approach, which leverages the integration of semantic and motion key states as guidance for task-aware reward estimation. Initially, we utilize the visual-language models to segment the expert trajectory into semantic key states, indicating the objectives of "what to do". Within the intervals between semantic key states, optical flow is employed to capture motion key states to understand the process of "how to do". By integrating a thorough grasp of both semantic and motion key states, we refine the trajectory-matching reward computation, encouraging task-aware exploration for efficient online imitation learning. Our experiment results prove that our method is more sample efficient in the Meta-World and LIBERO environments. We also conduct real-world robotic manipulation experiments to validate the efficacy of our method, demonstrating the practical applicability of our KOI method.

imitation learning, key state, learning, (12 more...)

arXiv.org Artificial Intelligence

2408.02912

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Sweden > Halland County > Halmstad (0.04)

Genre:

Research Report > New Finding (0.88)
Instructional Material > Online (0.86)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.54)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

Add feedback

Probabilistic Counterexample Guidance for Safer Reinforcement Learning (Extended Version)

Ji, Xiaotong, Filieri, Antonio

arXiv.org Artificial IntelligenceJul-12-2023

Safe exploration aims at addressing the limitations of Reinforcement Learning (RL) in safety-critical scenarios, where failures during trial-and-error learning may incur high costs. Several methods exist to incorporate external knowledge or to use proximal sensor data to limit the exploration of unsafe states. However, reducing exploration risks in unknown environments, where an agent must discover safety threats during exploration, remains challenging. In this paper, we target the problem of safe exploration by guiding the training with counterexamples of the safety requirement. Our method abstracts both continuous and discrete state-space systems into compact abstract models representing the safety-relevant knowledge acquired by the agent during exploration. We then exploit probabilistic counterexample generation to construct minimal simulation submodels eliciting safety requirement violations, where the agent can efficiently train offline to refine its policy towards minimising the risk of safety violations during the subsequent online exploration. We demonstrate our method's effectiveness in reducing safety violations during online exploration in preliminary experiments by an average of 40.3% compared with QL and DQN standard algorithms and 29.1% compared with previous related work, while achieving comparable cumulative rewards with respect to unrestricted exploration and alternative approaches.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2307.04927

Country:

Europe > United Kingdom (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)

Genre: Research Report (0.50)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support

Tran-The, Hung, Gupta, Sunil, Nguyen-Tang, Thanh, Rana, Santu, Venkatesh, Svetha

arXiv.org Machine LearningJul-24-2021

We address policy learning with logged data in contextual bandits. Current offline-policy learning algorithms are mostly based on inverse propensity score (IPS) weighting requiring the logging policy to have \emph{full support} i.e. a non-zero probability for any context/action of the evaluation policy. However, many real-world systems do not guarantee such logging policies, especially when the action space is large and many actions have poor or missing rewards. With such \emph{support deficiency}, the offline learning fails to find optimal policies. We propose a novel approach that uses a hybrid of offline learning with online exploration. The online exploration is used to explore unsupported actions in the logged data whilst offline learning is used to exploit supported actions from the logged data avoiding unnecessary explorations. Our approach determines an optimal policy with theoretical guarantees using the minimal number of online explorations. We demonstrate our algorithms' effectiveness empirically on a diverse collection of datasets.

algorithm, online exploration, unsupported action, (14 more...)

arXiv.org Machine Learning

2107.11533

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(6 more...)

Genre: Research Report (0.70)

Industry: Education > Educational Setting > Online (0.41)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.41)

Add feedback